4.15 Conclusions

a diagram (Figure 23) that summarizes my personal recommendations based on the concepts and literature that was reviewed.

「Figure 23は、この論文でレビューした概念と文献に基づく #rasbt 氏の個人的なオススメを要約したもの」

https://sebastianraschka.com/images/blog/2018/model-evaluation-selection-part4/model-eval-conclusions.jpg

赤：汎化性能見積もり（＝モデル評価）

大きなデータセット

2分割ホールドアウト（train/test）（1.5 Holdout Validation）

正規近似による信頼区間（1.7 Confidence Intervals via Normal Approximation）

小さなデータセット

独立したテストセットを使わない、（反復）k分割交差検証（3.4 Introduction to k-fold Cross-Validation）

（反復）とカッコが付くのは、3.5 Special Cases: 2-Fold and Leave-One-Out Cross-Validation→4.5 Multiple Hypotheses Testing（まゆつばと強調している）

独立したテストセットを使わない、Leave-one-out交差検証

0.632(+) bootstrapを介した信頼区間（2.4 The Bootstrap Method and Empirical Confidence Intervals）

緑：モデル選択（ハイパーパラメタ最適化）と汎化性能見積もり

大きなデータセット

3分割ホールドアウト（train/validation/test）（3.3 The Three-Way Holdout Method for Hyperparameter Tuning）

小さなデータセット

独立したテストセットを使った、（反復）k分割交差検証（3.7 Model Selection via k-fold Cross-Validation）

独立したテストセットを使った、Leave-one-out交差検証（3.5 Special Cases: 2-Fold and Leave-One-Out Cross-Validation）

青：モデルとアルゴリズムの比較

The abbreviation "MC" stands for "Model Comparison," and "AC" stands for "Algorithm Comparison"

大きなデータセット

（アルゴリズム比較）複数の独立した訓練セットとテストセット

（モデル比較）McNemar検定（4.3 Comparing Two Models with the McNemar Test）

（モデル比較）CochranのQ + McNemar検定（4.6 Cochran’s Q Test for Comparing the Performance of Multiple Classifiers）

小さなデータセット

（アルゴリズム比較）Combined 5x2cv F test（4.12 Alpaydin’s Combined 5x2cv F-test？）

（アルゴリズム比較）Nested cross-validation（4.14 Nested Cross-Validation）

It should be stressed that parametric tests for comparing model performances usually violate one or more independent assumptions (the models are not independent because the same training set was used, and the estimated generalization performances are not independent because the same test set was used.)

「モデルの汎化性能比較のためのパラメトリック検定は、ふつう1つ以上の独立性の仮定を破っている」

「同一の訓練セットが使われるため、複数のモデルは独立でない」

「同一のテストセットが使われるため、見積もられた汎化性能は独立でない」

However, in most practical applications, the size of the dataset is limited; hence, we can use one of the statistical tests discussed in this article as a heuristic to aid our decision making.

「ほとんどの実用的な応用ではデータセットのサイズに制限がある」

「それゆえ私たちはこの論文で議論した統計的検定の1つを意思決定に役立つヒューリスティクスとして使うことができる」

the recommendations I listed in the figure above are suggestions and depend on the problem at hand.

「Figure 23にリストにしたオススメは提案であり、手元の問題に依存する」

using a single training and test set when only a few data records are available can be problematic for several reasons discussed throughout Section 2 and Section 3

「データ数が少ないときに、単一の訓練セットとテストセットを使うのはSection 2と3を通して論じたいくつかの理由により問題となりうる」

If the dataset is very small, it might not be feasible to set aside data for testing, and in such cases, we can use k-fold cross-validation with a large k or Leave-one-out cross-validation as a workaround for evaluating the generalization performance.

「データセットが非常に小さいとき、テストのためにデータを取り分けておくのは実現可能ではないかもしれない」

「そのような場合には、汎化性能を評価するワークアラウンドとして、大きなkまたはLeave-one-out交差検証でk分割交差検証を使うことができる」

However, using these procedures, we have to bear in mind that we then do not compare between models but different algorithms that produce different models on the training folds.

「しかしながら、これらの手順（＝小さなデータセットへのk分割交差検証）を用いるとき、訓練foldについて、モデル間の比較ではなく、異なるモデルを生成する異なるアルゴリズムを比較することを私たちは受け入れなければならない」

Nonetheless, the average performance over the different test folds can serve as an estimate for the generalization performance (Section 3) discussed the various implications for the bias and the variance of this estimate as a function of the number of folds).

「それでもやはり、別々のテストfoldに渡った性能の平均は汎化性能の見積もりとして供することができる」

「3 Cross-validation and Hyperparameter Optimizationでは、fold数の関数としたこの見積もりのbiasとvarianceについて、様々な影響を議論した」

For model comparisons, we usually do not have multiple independent test sets to evaluate the models on, so we can again resort to cross-validation procedures such as k-fold cross-validation, the 5x2cv method, or nested cross-validation.

「モデル比較について、私たちはふつう、複数のモデルを評価できる複数の独立したテストセットを持っていない」

「だから、私たちは再び交差検証（k分割交差検証、5x2cv、nested交差検証）の手順に手を出す」

（再度交差検証が言及されたことを受けて）結びに Varoquaux 2017 Cross-validation failure: small sample sizes lead to large error bars を引いて

Cross-validation is not a silver bullet. However, it is the best tool available, because it is the only non-parametric method to test for model generalization.

「交差検証は銀の弾丸ではない」

「しかしながら、モデルの汎化を検証できる唯一のノンパラメトリックな手法であり、手に入る中でもっともよいツールである」

TODO 交差検証がノンパラメトリックとは？（3.2 About Hyperparameters and Model SelectionでkNNについて「訓練するパラメタがない」と言っているのと同じ意味？）